Modeling Comma Placement in Chinese Text for Better Readability using Linguistic Features and Gaze Information

نویسندگان

  • Tadayoshi Hara
  • Chen Chen
  • Yoshinobu Kano
  • Akiko Aizawa
چکیده

Comma placements in Chinese text are relatively arbitrary although there are some syntactic guidelines for them. In this research, we attempt to improve the readability of text by optimizing comma placements through integration of linguistic features of text and gaze features of readers. We design a comma predictor for general Chinese text based on conditional random field models with linguistic features. After that, we build a rule-based filter for categorizing commas in text according to their contribution to readability based on the analysis of gazes of people reading text with and without commas. The experimental results show that our predictor reproduces the comma distribution in the Penn Chinese Treebank with 78.41 in F1-score and commas chosen by our filter smoothen certain gaze behaviors.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Readability Classification of Crowd-Sourced Data based on Linguistic and Information-Theoretic Features

This paper presents a classifier of text readability based on information-theoretic features. The classifier was developed based on a linguistic approach to readability that explores lexical, syntactic and semantic features. For this evaluation we extracted a corpus of 645 articles from Wikipedia together with their quality judgments. We show that information-theoretic features perform as well ...

متن کامل

Automatic Assessment of Information Disclosure Quality in Chinese Annual Reports

Information disclosure in annual reports is a mandatory requirement for publicly traded companies in China. The quality of information disclosure will reduce information asymmetry and therefore support market efficiency. Currently, the evaluation of the information disclosure quality in Chinese reports is conducted manually. It remains an untapped field for NLP and text mining community. The go...

متن کامل

可讀性預測於中小學國語文教科書及優良課外讀物之研究(A Study of Readability Prediction on Elementary and Secondary Chinese Textbooks and Excellent Extracurricular Reading Materials) [In Chinese]

Readability is basically concerned with readers’ comprehension of given textual materials: the higher the readability of a document, the easier the document can be understood. It may be affected by various factors, such as document length, word difficulty, sentence structure and whether the content of a document meets the prior knowledge of a reader or not. However, simple surface linguistic fe...

متن کامل

All Mixed Up? Finding the Optimal Feature Set for General Readability Prediction and Its Application to English and Dutch

Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, although NLP-inspired research has focused on adding more complex readability features, there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close de...

متن کامل

Pause and Stop Labeling for Chinese Sentence Boundary Detection

The fuzziness of Chinese sentence boundary makes discourse analysis more challenging. Moreover, many articles posted on the Internet are even lack of punctuation marks. In this paper, we collect documents written by masters as a reference corpus and propose a model to label the punctuation marks for the given text. Conditional random field (CRF) models trained with the corpus determine the corr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013